arXiv:1507.04436vl [stat.ML] 16Jul2015 


Joint Tensor Factorization and Outlying Slab Suppression with 

Applications 

^Xiao Fu, 'Kejun Huang, *Wing-Kin Ma, ^Nicholas D. Sidiropoulos, and ^Rasmus Bro 


^Department of Electrical and Computer Engineering, University of Minnesota, 
Minneapolis, 55455, MN, United States 
Email: (xfu,huang663,nikos)@umn.edu 

*Department of Electronic Engineering, The Chinese University of Hong Kong, 

Hong Kong 
Email: wkma@ieee.org 

^Department of Food Science, Faculty of Science, University of Copenhagen, 
Rolighedsvej 30, DK-1958 Frederiksberg C 
Email: rb@life.ku.dk 

July 17, 2015 


Abstract 

We consider factoring low-rank tensors in the presence of outlying slabs. This problem is 
important in practice, because data collected in many real-world applications, such as speech, 
fluorescence, and some social network data, fit this paradigm. Prior work tackles this problem 
by iteratively selecting a fixed number of slabs and fitting, a procedure which may not converge. 
We formulate this problem from a group-sparsity promoting point of view, and propose an alter¬ 
nating optimization framework to handle the corresponding £ p (0 < p < 1) minimization-based 
low-rank tensor factorization problem. The proposed algorithm features a similar per-iteration 
complexity as the plain trilinear alternating least squares (TALS) algorithm. Convergence of 
the proposed algorithm is also easy to analyze under the framework of alternating optimization 
and its variants. In addition, regularization and constraints can be easily incorporated to make 
use of a priori information on the latent loading factors. Simulations and real data experiments 
on blind speech separation, fluorescence data analysis, and social network mining are used to 
showcase the effectiveness of the proposed algorithm. 
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1 Introduction 


Factoring a tensor (i.e., a data set indexed by three or more indices) into rank-one components is 
a decomposition problem which is frequently referred to as parallel factor analysis (PARAFAC) or 
canonical decomposition (CANDECOMP), or canonical polyadic decomposition (CPD). Unlike two- 
way factor analysis (i.e., matrix factorizations), three- or higher-way low-rank tensor factorization 
reveals essentially unique factors under quite mild conditions, which is desirable when dealing 
with latent parameter estimation problems. Since the late 1990s, PARAFAC has been successfully 
applied to wireless communications for blindly estimating the spatial channels or the users’ code¬ 
division signatures mm; array processing for finding the directions-of-arrival of the emitters mm; 
chemometrics for resolving the spectra of chemical analytes [5]; blind speech and audio separation 
for estimating the mixing system mm; and, more recently, power spectra separation for cognitive 
radio [8], and big data mining for social group clustering [§.]. 

A high-order tensor can also be considered as a set of lower-order tensors. For example, a data 
cube (i.e., a three-way tensor) can be considered as a set of matrices (two-way tensors), obtained 
by fixing one index to a particular value. Each such piece of the original data, whose order has 
been reduced by one, will be called a slab. Slabs are usually physically meaningful in various 
applications. For example, in blind speech and audio separation, the received signals’ short-term 
covariance matrix, assumed constant within a short coherence interval and sometimes referred to 
as local covariance maun, can be considered as a slab of a three-way tensor; in fluorescence data 
spectroscopy, a measurement matrix that consists of emissions and excitations of the stimulated 
analytes is a slab [5]; and in array processing, the received raw signals at a subarray can be 
considered as a slab [3] - Due to this physical correspondence, however, strong data contamination 
or corruption frequently happens at the slab level (rather than element-wise). A typical example is 
blind speech separation - it has been observed that locally correlated speech sources may create local 
covariances (slabs) that do not obey the low-rank tensor model [IT]. Also, in chemometrics, e.g. in 
fluorescence spectroscopy, it is common that certain samples representing erratic measurements or 
samples of unusual constitution end up influencing the fitted model badly mm- 

Factoring a low-rank tensor in the presence of outlying slabs has been considered before. In 
the literature, the most closely related work may be [12]. There, an algorithm that iteratively 
selects a fixed number of slabs to fit with a low-rank tensor model was proposed. A main drawback 
with this algorithm is that it may not converge. Also, it is not easy to determine how many slabs 
should be selected to fit in advance. Similar insights are also seen in the analytic chemistry context; 
see [13H15j . In [16], the authors considered a different yet related scenario. There, a PARAFAC 
approach was proposed by changing the least squares-based optimization criterion to the tu-norm 
based fitting criterion, to make the low-rank decomposition robust against outlying elements. The 
resulting algorithms are alternating linear programming or alternating weighted median filtering 
(WMF). The algorithms in [16] do not need to pre-define the number of slabs to select for fitting, 
but they can be inefficient even when the problem size is medium. In addition, the i\ criterion is 
optimal in the maximum likelihood sense, when the noise follows the i.i.d. Laplacian distribution; 
but it is not specialized for (strong) slab-level outliers, as will be shown in the simulations. 

Contributions: In this work, we consider modeling and tackling the low-rank tensor decomposition 
problem with outlying slabs from a different perspective. Specifically, we formulate the problem 
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from a group-sparsity promoting viewpoint, and come up with an l v (0 < p < 1) fitting criterion. 
We propose to tackle this hard optimization problem using an alternating optimization strategy: 
by judiciously recasting the original problem into a more convenient form, we show that it can be 
tackled using a simple algorithm whose block updates admit closed-form solutions. This algorithm 
tends to iteratively select some clean slabs to fit with a PARAFAC model and downweight the 
outlying slabs at the same time. Reminiscent of classical robust fitting, the proposed algorithm does 
not assume knowledge of the number of clean slabs. Plus, drawing from existing theoretical results 
on alternating optimization m and its variants such as maximum block improvement (MBI) [18] . 
convergence of the proposed algorithm can be characterized. It is also worth noting that the 
proposed algorithm has almost the same per-iteration complexity as the trilinear alternating least 
squares (TALS) algorithm jT]-{3] , which is computationally much cheaper than the algorithms in [ 161 . 

Extensions to regularized and constrained cases are also considered in this work, since incorpo¬ 
rating a priori information on the loading factors is important in applied data analysis. Following 
the same alternating optimization framework, we propose to handle the subproblems by employ¬ 
ing an alternating direction method of multipliers [19] (ADMM)-based algorithm, which allows us 
to deal with different types of regularization and constraints of interest, under a unified update 
strategy. 

Besides simulations using synthetic data for verifying the ideas, we use several simulations and 
experiments with real data to showcase the effectiveness of the proposed approaches. First, the 
basic robust algorithm is applied on blind speech separation simulations, where real speech segments 
are mixed under realistic room acoustic impulse response scenarios. The separation performance of 
the proposed algorithm is shown to be superior to the earlier state-of-the-art. Then, the proposed 
algorithm is applied to a fluorescence data set; and finally to the ENRON e-mail corpus. Interesting 
and nicely interpretable results are obtained in both cases. 

Notation : We largely follow standard signal processing (and some Matlab) notational conventions, 
for convenience. Specifically, X € M ixJxR ( ] eno ^ es a three-way tensor, and X(i,j, fc) denotes the 
element that is indexed by ( i,j , k ); X(b b :)> X(b ji 0 and X(b:, k) denote the ith horizontal slab, the 
jth lateral slab, and the /cth frontal slab, respectively; X(i,:) and X(:, j) denotes the ith row and the 
jth column of the matrix X; T denotes the transpose operator; 1 denotes the MoorePenrose pseudo¬ 
inverse operator; ||x|| p = (YliLi |^i| p ) 1,/p for x € for 0 < p < oo; vec(X) = [X 7 (:, 1),... , X 7 (: 

, J)] T for X G &x and rank(X) denote the Kruskal rank and the rank of X, respectively; 

o, ©, <gi and © denote the outer product, the Hadmard product, the Kronecker product, and the 
Khatri-Rao product, respectively; Diag(x) denotes a diagonal matrix that holds the x\,... ,x m as 
the diagonal elements. 

2 Preliminaries on PARAFAC 

A simple description of the PARAFAC model is as follows. PARAFAC aims to represent a three-way 
tensor X G <C IxJxK using PARAFAC three latent factor matrices: 

R 

X^A(:,r)oB(:,r)oC(:,r), (1) 

r =1 

where A € C IxR , B G C JxR , C G C KxR , and R is called the rank of the PARAFAC model. Any 
tensor X € <C IxJxK can be exactly represented this way if a large-enough R < min (/J, JK,IK ) is 
used; but we are usually interested in using relatively small R. to capture the ‘principal components’ 
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of X- Equivalently, each element of the tensor can be represented as X(i, j, k) ~ A(i, r)B(j, r)C(k, r ). 

A three-way tensor is also a set of matrices, or, slabs, which are obtained by fixing one index. There 
are three types of slabs of a three-way tensor, namely, the horizontal slabs ({X(i,:, :)}^ =1 ), the lat¬ 
eral slabs ({X(:, 3i : )}/=i)j and the frontal slabs ({X(:, k)}^ =1 ). If the PARAFAC model in (fT|) 
holds exactly, each type of slab has a compact representation, i.e., 


Lateral slabs | X^ 
Frontal slabs | xj^ 
Horizontal slabs j X® 


X(:,j,:) = CD J (B)A r } J , 

> J= l 

X(:,:,fc) = AD fc (C)B T ) A , 

) k =1 

X(i,:,:) = BD l (A)C r | / , 

) i= 1 


where D,.(X) = Diag (X(r,:)). Fig. [T] gives a visual illustration of a three-way tensor X and its 
slabs. 



Figure 1: Slabs of a three-way tensor. 

Unlike matrix factorizations, which are in general non-unique, the PARAFAC decomposition 
has (essentially) unique solution under quite mild conditions. For example, Kruskal proved the 
following result for a real-valued low-rank tensor [20] : If 

&A T T kc P 2 R + 2, (2) 

then (A, B,C) are unique up to a common column permutation and scaling, i.e., X = 

, r) o B(:, r) o C(:, r) = £ r i, A(:, r) o B(:, r) o C(:, r) => A = AnA Q , B = BIIA b , C = CnA c , 
where II is a permutation matrix and A a , A&, A c , are full-rank diagonal matrices such that 
A a A{,A c = 1. If A is drawn from an absolutely continuous distribution over then k\ = 

rank(A) = min(J, R ) with probability one. It follows that if A, B, C are drawn this way, then the 
condition in ([2]) can be simplified: if 

min{/, R} + min{ J, R} + min {A, R} > 2 R + 2, (3) 
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then (A, B, C) are unique up to a common column permutation and scaling, with probability one. 
Notice that under the condition in ([2]) or 0, the loading matrices A,B,C need not be tall. This 
is advantageous in challenging application scenarios, e.g., mixing system identification when the 
system is under-determined mm- 

In practice, when modeling error and noise exist, it makes more sense to seek the best rank-i? 
approximation of a tensor rather than computing its exact rank factorization. To find such an 
approximation, the least squares criterion is commonly adopted: 


min 

A.B.C 


R 

X-J>(:,r)oB(:,r)oC(:,r) 

r =1 


2 


F 


( 4 ) 


The above problem is nonconvex, and thus could be very difficult to solve. In fact, recent research 
(21) showed that Problem (J4]) may even be ‘ill-posed’, meaning that the best rank-/? approximation 
of a tensor may not even exist. In practice, nevertheless, the formulation in 0 allows one to devise 
computationally affordable (albeit generally suboptimal) algorithms, and some of these algorithms 
have proven successful in various applications. To deal with the optimization problem in dH), a 
popular way is to make use of the matrix unfoldings of the tensor. Specifically, by vectorizing each 
type of slabs and treating them as columns of a matrix, we obtain the three matrix unfoldings, 
namely, X (i) = (A 0 C)B T , X (2) = (B © A)© 71 , and X <2) = (C © B)A r , where we have used 
the vectorization property of the Khatri-Rao product vec(XDiag(z)Y r ) = (Y 0 X)vec(z). Using 
the unfoldings, Problem [4] can be tackled by cyclically solving the following three least squares 
problems: 

B := argmin ||X (1 ^ — (A 0 C)B 7 ||^ (5a) 

B 

C := argmin ||X (2 ^ — (B 0 A)C T \\ 2 F (5b) 

c 

A := argmin ||X® — (C 0 B)A T ||^. (5c) 


The above updates yield the popular trilinear alternating least squares (TALS) algorithm [TJ[2]. 

Although quite a lot of different PARAFAC algorithms exist, e.g., [ T0U22H25] . TALS (and its 
close relatives) has been the workhorse of low-rank tensor decomposition for decades for several 
reasons: First, TALS can be easily implemented, since each iteration only involves relatively simple 
linear least squares subproblems. Second, it features monotone convergence of the cost function, 
without the need to tune (e.g., step-size) parameters to ensure this. Third, it has the flexibility to 
incorporate constraints and regularization on the loading factors under its alternating optimization 
framework, with a reasonable complexity increase. 


3 A Closer Look At Motivating Examples 

In many applications, some slabs of the collected tensor data are highly corrupted, for various 
reasons. In this section, we take a closer look at some pertinent examples that we have encountered 
in rather different fields. In all of them, corrupted slabs can throw off the analysis, producing 
inconsistent and hard to interpret PARAFAC models. 
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3.1 Blind Speech Separation 

It has been shown that PARAFAC can be applied to blind speech separation (BSS) to identify the 
mixing system E1IZ1- As a quick review, the BSS signal model is 

x(f) = As (t) + n(f), t = 1,2,... (6) 

where x(t) = [aq (t),... ,xi(t)] T € M 7 denotes the received signals by the I sensors at time t, 

A € M IxR denotes the mixing system, s(t) = [si(i),..., sn(t)] T € M 77 denotes the R speech 
sources (presumed to be uncorrelated), and n(t) = [ni(f),... ,nj(t)] T € M 7 denotes zero-mean i.i.d. 
Gaussian noise with variance cr 2 . To connect this model to the PARAFAC model, we calculate the 
local covariance of the received signals within time frame k by 

X(:,:, k) = E{x(f)x T (f)} - <r 2 I 

» AE{s(f)s T (f)}A r , t e[(k — 1 )L + 1, kL\, 

where a 2 represents the estimated noise variance and L denotes the time frame length. By assuming 
that the sources are uncorrelated, we see that the local covariance of the sources in frame k, i.e., 
for t € [(k — 1 )L + 1, kL\, 

E{s(t)s T (f)} = Diag([E|si(t)| 2 ,...,E|s i? (t)| 2 ]), 

is a diagonal matrix. Hence, if we let C (k ,:) = [E|si(t)| 2 ,..., E|s#(f)| 2 ] for t € [(k — 1 )L + 1, kL\, 
we see that X(:, k) = ADj.(C)A I is a frontal slab of a three-way tensor (with B = A), and 
thus PARAFAC can be applied to X to estimate the mixing system A. Using the estimated 
A, the individual source signals can be estimated. In the presence of reverberation, the mixing 
system model becomes convolutive (i.e., frequency-selective) instead of instantaneous. This is a 
more challenging scenario, which can again be tackled using PARAFAC in the frequency domain, 
see EEHm and references therein. 

A more subtle difficulty is that some speech sources exhibit (strong) short-term cross correla¬ 
tions, even though they are approximately uncorrelated over the long run. Consequently, the local 
covariances of the sources in some frames have significant off-diagonal elements, and the correspond¬ 
ing slabs deviate from the nominal model X(:,:, k) = ADj ; (C)A i . In such cases, directly applying 
standard PARAFAC algorithms may not yield satisfactory speech separation performance m- 

3.2 Fluorescence Spectroscopy 

Fluorescence excitation-emission measurements (EEMs) are used in many different fields such as 
skin analysis, fermentation monitoring, environmental, food, and clinical analysis M- A fluores¬ 
cence sample is obtained by using a beam of light that excites the electrons in molecules of certain 
compounds and causes them to emit light; the emission spectra are then measured at several 
excitation wavelengths. A fluorescence EEM sample can be represented by 

X(*,:,0 =BD ; (A)C t , 

where B(:, r) for r = 1,..., R corresponds to the spectral emission r, C(:, r) denotes the correspond¬ 
ing excitation values, and A(z, r) denotes the corresponding concentration (scaling) at sample i. 
By measuring multiple samples, a PARAFAC model can be formed, and each sample is a slab. 

Fluorescence data analysis has been recognized as a very successful example of applying PARAFAC 
algorithms to real-world data. At the same time, it has also been noticed that anomalous EEM 
samples occur frequently due to various reasons [IMS]. 
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3.3 Social Network Mining 

For some three-way social network data sets, every slab X(:, -,k) is a connected graph measured 
within time period k. For example, in the ENRON e-mail data set [26], X(i, j, k ) denotes the 
‘connection intensity’ of person i and person j at time period k (i.e., the number of e-mails sent 
by person i to person j within month k). Another example is the Amazon purchase data. There, 
X(i,j,k ) represents the amount of product j bought by person i in week k. For such data, each 
rank-one component of the PARAFAC model can be interpreted as the interaction pattern of a 
social group over time m- To be specific, consider 

R 

X(:,:,fc) ~ AD fc (C)B r = ]T C(k, r)A(:, r)(B(:, r)) T . 

r= 1 

Here, the nonzero elements in A (:, r) and B(:, r) create a clique (a subgraph) A(:, r)B T (:, r), which 
can be interpreted as a social group, and R corresponds to the number of social groups. Taking 
the ENRON e-mail data as an example, A(:,r)B i (:,r) is a group, where the people corresponding 
to the non-zero elements of A(:,r) have similar e-mail sending patterns to those corresponding to 
the non-zero elements of B(:,r). C (k,r) is a time-varying parameter of this group, which means 
that the e-mail sending pattern of this group is a rank-one matrix factor whose intensity (e-mail 
volume) varies with time. 

With this model, factoring the data box into its latent factors is equivalent to mining the 
underlying social groups, which finds applications in designing recommendation systems, analyzing 
ethic and cultural groups, and even detecting criminal organizations. However, the social network 
data sets are in general not following a generative signal model, which means that several slabs may 
have large modeling errors. As we will see later, some unexpected events (such as the ENRON crisis) 
might make the group e-mail patterns quite irregular during some period. The slabs measured in 
these irregular time intervals might need to be identified and somehow down-weighted when the 
objective is to analyze the normal interaction patterns, or to detect those anomalies. 

4 Problem Formulation 

Motivated by the examples in the previous section, we will focus on modeling, formulating, and 
solving the low-rank tensor decomposition problem in the presence of outlying slabs. Our main goal 
is an easily implemented optimization framework; practical considerations such as regularization, 
constraints, initialization and complexity will also be discussed. For presentation simplicity, we 
will assume that corruption happens in some horizontal slabs throughout the development of the 
algorithm; see Fig. [2j Algorithms dealing with corrupt lateral or frontal slabs can be obtained by 
simply permuting the modes of the tensor, by virtue of symmetry. 

To begin with, let us assume that some horizontal slabs have been corrupted by gross errors; 
i.e., we have 

x ( 3)= fBD l (A)C r + O l , iGAf, 

i \bd, ; (a)c t , ieN c , 

where M C {1,...,/} is the index set of the outlying slabs and M c = {1,— M. The gross 

( 3 ) 

error component Oj could be strong so that X) ' is far from the nominal ‘clean signal model’, i.e., 
X 4 (3) = BD. t (A)C T . Under the corruption model in (J7J), our first observation here is that there 
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corrupted slabs 



Figure 2: The corruption model: some horizontal slabs are outliers. 


may still be enough clean data to enable us to recover B and C intact. Thus, our idea begins with 
a formulation that guarantees the identihability of B and C under some conditions. 

We wish to fit the clean data slabs with a PARAFAC model. In practice, J\f is usually unknown, 
but its cardinality may be small relative to I. Hence, we address this problem from a group-sparsity 
promoting viewpoint. We formulate the problem as 


min 

A.B.C 




i= 1 


(C©B)(A(i,:)) 



( 8 ) 


where I(x) is defined as 


T{x) 


1 , i/O 
0, x = 0. 


The criterion tends to make X^(:, i) — fC0B)( A(i, :)) T = 0 for as many i’s as possible. Intuitively, 
if there are enough clean slabs to identify the underlying nominal PARAFAC model, solving the 
above optimization problem should identify B and C. The following result confirms this intuition. 


Claim 1 Assume that the elements of A are drawn from an absolutely continuous distribution over 
R /x ^, and likewise B and C are drawn from absolutely continuous distributions over and 

R KxR ; respectively. Define 


c := 2 R + 2 — min{ J, i?} — min{AT, R}, 
and suppose that c < min{|A/" c |, R}, and 


WA > P) 

Then, with probability one, the optimal B*, C* ; and A*(A r c , 0 that solve Problem (J8]) are B. C, 
and A(J\f c ,:) with a common column permutation and scaling; i.e., A*(A/" C ,:) = A(A/" C , :)IIA a , 
B* = BUA b , C* = CnA c , where II is a permutation matrix and A. a , A. b , A c , are full-rank 
diagonal matrices such that A a A b A c = I. 















The proof of Claim[T]can be found in Appendix[Aj Claim [T] helps us understand the fundamental 
limitation of the proposed criterion in (fHj) : Under the signal model in (j7jh if about one half of the 
horizontal slabs follow the clean signal model, we can still correctly identify the two loading factors 
B and C (and at least part of A). Solving Problem (jHJ) is very challenging though - both PARAFAC 
decomposition and group-sparsity maximization (cardinality minimization) are nonconvex problems 
on their own, so dU) is compounding two already challenging problems. In the next section, a 
more practical optimization surrogate will be employed to approximate Problem (|8|) . and a simple 
alternating optimization algorithm will be presented to tackle this surrogate optimization problem. 


5 Basic Algorithmic Framework 


To approximate Problem ([Hjl . we propose to employ the smoothed i p quasi-norm as our working 
objective; i.e., by replacing F^=]A( X *) by X^=i ( X 1 + e) p / 2 , we deal with the following surrogate: 


min 

A,B,C 



X (3) (:u)-(COB)(A(i,:)) T 



( 10 ) 


where 0 < p < 1 and e > 0. The idea comes from compressive sensing, where the quasi Iq norm 
is often approximated by the quasi £ p norm or £\ norm, since the latter two are computationally 
tractable and often yield practically good results [281I3T] . Here, e is a small smoothing parameter 
to keep the cost function in its continuously differentiable domain. 

The cost function in (HOI) can be manipulated according to the following lemma: 


Lemma 1 Assume 0 < p < 2, e > 0, and 4> p (w) := \ P W ) ? 2 + ew ■ Then, we have 

(x 2 + e) P// “ = min wx 2 + cb v (w ), 

v ’ w>0 y 

and the unique minimizer is 

w o P t = | {x 2 + e) — . (11) 

Proof: First, it can be seen that <p p (w) is strictly convex on its domain (i.e., the interior of 
w > 0), since its second order derivative is positive when w is positive, i.e., 


\/ 2 (t> p {w) = 


4 

P(P ~ 2) 



> 0 . 


Therefore, 


min wx 2 + 4> p (w) 

w> 0 


( 12 ) 


admits a unique optimal solution u; opt = (p/2)(x 2 + e)( p ~ 2 )/ 2 , which can be obtained by simply 
checking the first order optimality condition. Substituting w Qpt back into the cost of (fl2|) . the 
minimum cost is ( x 2 + e) p / 2 . □ 

By Lemma [U Problem H0|) can be re-expressed as the following problem: 


min 

A.B.C, 

{lO;>0} 


j>i X( 3 )(:,i)-(C©B)(A(1,:)) T +^M™i 


i =1 


i =1 


(13) 
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The structure of Problem m is nice: it allows us to optimize its cost with respect to (w.r.t.) 
the four blocks A,B,C, and {wi}j =1 in an alternating optimization fashion, fixing three blocks 
and updating one each timt@. As we will show next, each conditional optimization problem has a 
closed-form solution. 

First, the problem w.r.t. A is separable w.r.t. i. For each i, the problem w.r.t. A (i, :) is a 
simple least squares problem. Hence, the subproblem w.r.t. A admits the following closed-form 
solution: 

A = ^(C©B ) t X (3) ) T , 

which is the same as that in the plain TALS mm- Notice that in practice, we compute A by the 
following expression: 

A r = (C T C ® B r B)” 1 (C © B) t X (3) . 

In practice, the matrix inversion part and (C © B) T X® should be computed separately. The 
reasons are as follows. First, the inversion part, i.e., (C T C © B T B) -1 , is usually the inverse of 
a small (R-by-R) matrix. Second, the multiplication of a Khatri-Rao structured matrix and an 
unfolded tensor is a computationally expensive operation if /, J, K are large (specifically, this single 
step costs 2 RIJK flops), but fast algorithms are available when X is sparse [34H36] . j9j, [37] . 

To update B, we consider using the lateral slabs {CDj(B)A 7 }j =1 . From Problem (fl3l) . it can 
be readily seen that the ith column of {CDj(B)A r } is scaled by y/wi- Thus, the subproblem w.r.t. 
B can be written as 

J 2 

min V X%-CD,(B)A t W , 

B ■£—' II 3 ' F 

3 =1 

where W = Diag( v / uii,..., y/wj), or, in the following more compact form, 


min 

B 


(W © I)X (1) - ((WA) © C) B t 


The above is still a least squares problem. Therefore, the solution is simply 

B = (((WA) © C)* (W © I)X (1) ) T . 

In practice, the above solution can be written as follows: 

B r = ( WA © C) 1 (W © I)X (1) , (14a) 

= ((WA © C) t (WA © C)) _1 (WA © C) T (W © I)XW (14b) 

= (A t W 2 A © C T C) _1 (W 2 A © C) T X (1) , (14c) 


where we have used the property 


(Ui0V 1 ) T (U 2 ©V 2 ) = UfU 2 ®VfV 2 


to obtain (|14bl) . and 

(Ui©V 1 ) T (U 2 ©V 2 ) = U?’U 2 ©VfV 2 

X A similar auxiliary variable-based technique for splitting convex £ p norms (1 < p < 2) has appeared in [321133] , 
Lemma 1 can be considered as a nonconvex extension of the prior works in 13211331 . 
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to reach (I14cl) . Putting B in the form of (I14cl) is important. The reason is twofold: First, one does 
not have to actually compute and save (W <8)I)X (1 ^ since saving UK elements after each iteration 
is cumbersome when I,J,K are large (e.g., for I = J = I\ = 100, a million variables have to be 
saved in each iteration). Second, the efficient solvers for computing the product of a Khatri-Rao 
structured matrix and an unfolded tensor can be directly applied to (W 2 A 0 C) T X (1) . 

To update C, the rationale follows that of updating B. Specifically, as the zth row of each 
frontal slab is scaled by y/wi, we have can express the conditional problem w.r.t. C as 

K 2 

min wxf - WAD fc (C)B r , 

C k =1 F 


and the solution is also in closed form: 

C=((B0 WA) f (I 0 W)X (2 )) T . 


Similar to the B case, we can express C T as 

C T = (B t B ® A t W 2 A) _1 (B © W 2 A) t X (2) , 

and thus (B 2 B © A 2 W 2 A) 1 and (B 0 W 2 A) T X^ can be computed separately, if necessary in 
practice. 

The update w.r.t. follows Lemma[Q i.e., 


Wi 



(C0B)A r (i,:) 



Vi. 


Given these conditional updates, a simple strategy is to cyclically update A, B, C and {rci,..., wi}. 
The algorithm is summarized in Algorithm [TJ we will henceforth refer to it as Iteratively Reweighted 
Alternating Least Squares (IRALS), since w\,,wi can be interpreted as weights applied to the 
frontal slabs. From an algorithmic structure viewpoint, IRALS can be considered as an extension 
of the iteratively reweighted least squares (IRLS) algorithm [30j to tensor factorization. Since each 
partial minimization does not increase the value of the cost function and the function is lower 
bounded by zero, IRALS guarantees the convergence of the cost function of Problem (Ill- 


Remark 1 One may notice that we have not characterized the convergence of the solution sequence 
produced by IRALS yet. By some existing theories of alternating optimization, a stationary point 
for TALS and IRALS may be attained if the conditional objective function of every block is strictly 
convex and is continuously differentiable on the interior of the feasible set throughout all iterations 
[T7l Proposition 2.7.1]. In our context, this requires rank(B0A) = rank(COB) = rank(A0C) = R 
throughout all iterations, which is hard to check [38], Nevertheless, convergence to a stationary 
point of Problem (|10l) can be shown by employing some variants of alternating optimization, e.g., 
maximum block improvement (MBI) [18] . In this work, we adopt cyclic alternating optimization 
instead of MBI, for implementation simplicity and speed. Also, in practice, we are often interested 
in PARAFAC with regularization on the loading factors; in such cases, convergence to a stationary 
point of alternating optimization is usually not a problem any more [38] - see the next section for 
details. 
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Algorithm 1: IRALS 


input : X; , Bo, Co (initialization); and p £ (0,1]. 

1 B = Bo; 

2 C = C 0 ; 

3 W = I; 

4 repeat 

A := ((C T C © B t B)“ 1 (C © B) t X (3) ) T 
B := ((A T W 2 A © C T C) _1 (W 2 A 0 C) T xW) 
C := ((B t B © A t W 2 A) _1 (B 0 W 2 A) T X( 2 )) 


V 

Wi : = § 




+ e 


p-2 

2 


W 2 := Diag(u>i,... ,wr); 

10 until some stopping criterion is satisfied ; 

output: B, C. 


Vi; 


Remark 2 Until now, we have been dealing with the problem of interest (i.e., Problem (1101) 1 
indirectly. It is interesting to consider the relationship between the solutions of our working problem, 
i.e., Problem (j 13 1) . and Problem (1101) . It can be shown that 

Claim 2 Assume that (A*, B*, C*, is a stationary point of Problem (fT3l) . Then, (A*,B*,C*) 

is also a stationary point of Problem (fTUj) . 

The proof of Claim [2] can be found in Appendix IBl The key step is to invoke the uniqueness of the 
subproblem w.r.t. {w^ following Lemma Q] and marginalize it. By this claim, we see that dealing 
with Problem (1131) can yield a stationary point of Problem (1101) . whenever a limit point is reached. 

6 Extension: Constrained and Regularized Robust Tensor Factor¬ 
ization 

In this section, we consider practical extensions of IRALS, namely, constrained and regularized 
optimization. 

6.1 Adding Constraints and Regularization 

In data analytics, constrained and regularized low-rank tensor factorization often makes a lot of 
sense, since combining different types of a priori information may help find interpretable factors 
when modeling error and noise exist. Hence, there are many cases in which we are interested in 
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solving the following problem: 


min 

A,B,C 

S.t. 



X ( 3 ) (:,i)-(C©B)A T (z,:) 



+ A a /(A) + A b < ? (B) + A c / i (C), 
AgA,BgB, CeC, 


( 15 ) 


where A a , \b and A c are nonnegative regularization parameters, /(A), 5 (B) and g( C) are appro¬ 
priate regularization functions, and A, B and C represent (hard) constraints on the loading factors. 

In many cases, the constraints of interest include nonnegativity of the loadings, stemming 
from physical, chemical, or modeling considerations - e.g., concentrations, spectra, and e-mail 
counts are all nonnegative, and nonnegativity of the latent factors is important in social network 
mining m and in fluorescence spectroscopy m • More general ‘box’ constraints of type ai < 
A(i,r) < ah may also be appropriate, e.g., when we also have prior knowledge on the maximum 
possible concentration. 

Soft constraints may also be of interest, and these can be represented using appropriate regu¬ 
larization terms. If we know that the columns of B should be smooth, for example, we can employ 
the regularization 5 (B) = ||TB ||fi, where [39] 


1 -2 1 0 ••• 
0 1-210 


0 1-21 


(16) 


If we know that the loading factors are sparse, we can use || • ||i or any other sparsity-promoting 
function for regularization. As alluded to in Remark |T] adding regularization often brings side- 
benefits in terms of accelerating convergence, avoiding swamps, and attaining a stationary point 
[3811401 . For example, by adding the minimum-norm regularization ||A||^, ||B||^, and ||C||^, it 
can be easily seen that each block always has a unique solution, and thus a stationary point of 
Problem ([1511 can be attained by alternating optimization, whenever a limit point exists. Given the 
overall objective (1150 . and by Lemma [T] we may consider the equivalent reformulation 


min V —^ X ( 3 ) (:,i)-(C 0 B)A r (i,:; 
2=1 


+ — + A a /(A) + Afc</(B) + \ c h(C), 


2=1 


s.t. A € A, B £ B, CeC, 

Wi > 0, i = 1, 


The subproblem w.r.t. {wi} admits the same solution as before. In addition, the subproblems 
w.r.t. the loading factors are constrained and regularized least squares problems. To describe our 
treatment, we begin with the subproblem w.r.t. B: 


min 

B 

S.t. 


- (W®I)X« 
BgB. 


((WA) © C) B t 


2 + A 5 (B) 

r 
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To handle this problem, we propose the following alternating direction method of multipliers (ADMM) 
m based approach. We first rewrite the problem as 


1 

mm — 

B, Bi, b 2 2 


(W © I)XW - ((WA) © C) Bf 

+ Ap(B 2 ) + 1 b (B) 


s.t. B = Bi 


B = B 2 . 


(17) 


where 1^(X) is 0 for X € A and oo otherwise. ADMM solves the following augmented Lagrangian 
dual of Problem (fT71) [ TUI Chapter 3]: 


1 

max min — 
Ui, U 2 B, Bi, B 2 2 


(W © I)X (1) - ((WA) © C) Bf 

+\bg(B 2 ) + 1 b (B) 
+|||B-B 1 + U 1 ||f 

+|||B-B 2 + U 2 ||f, 


(18) 


where Ui and U 2 are the dual variables, and p > 0 is the stepsize parameter that is pre-specified. 
The standard ADMM updates for Problem (fTHl) are as follows [19] Chapter 3]: 


(W © I)X (1) - ((WA) © C) Bf 


1 

B i := argmin — 

& Bj 2 

+ |||B-B 1 + U 1 ||f 

B 2 := argmin \ b g(B 2 ) + ^||B - B 2 + U 2 ||f 

B 2 2 


B := argmin —I 
B B 2 


B — B 2 + U 2 ||p 


+ ^||B-B 1 + U 1 ||f+ 1 b (B), 


Ui :=Ui + B-Bi, 
U 2 := U 2 + B — B 2 , 


(19a) 

(19b) 


(19c) 

(19d) 

(19e) 


The proposed variable-splitting strategy brings several advantages. First, the problem w.r.t. 
Bi (i.e., Problem (jl9a.|) l is a least squares problem, whose solution is 

Bf := (A t W 2 A © C T C + pi)" 1 ((W 2 A 0 C) T X (1) + m) , 

where M = p(B + Ui). We see that the structure of (W 2 A © C) 1 X^ 1 ^ has been preserved, and 
thus efficient solvers for this matrix multiplication problem can be applied when the tensor is large 
and sparse mm, m, m- Second, the B 2 update is a proximal operator, which can be put in 
simple closed-form for many $(•)’s. Let us consider g( B 2 ) = f||TB 2 ||f as an example, which is 
often used for promoting smooth B. The B 2 update is then simply 

B 2 := (AfeT T T + pI) _1 (B + U 2 ). 
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Note that when T = I, this further reduces to B 2 = \^ +p (B + U 2 ) - which is useful to control 
the scaling of B. Also, if one wants to promote sparsity in B, several convex and nonconvex <?(•)’s 
that enable closed-form solution of Problem (I19bl) can be emoloved: see m- Third, the B update 
(Problem (|19c[) ) also has a simple form: 

B:= Q(B 2 -U 2 +B 1 -U 1 )^ , 

where (•)# is a projector to the set B. For many constraints B , this projection step is fairly simple. 
For example, if B = K + , the projection is 

B:= Q(B 2 -U 2 + Bi-Ui)^ , 

where (•)+ is an element-wise operator such that (x)+ = max{x, 0 }; see many other efficient 
projections in HS3- 

The ADMM updates w.r.t. C and A are similar; they are relegated to Appendix[Cj Overall, we 
solve the subproblems w.r.t. A, B, C and W cyclically as in the last section except that the former 
three are solved by ADMM. We should mention that the convergence properties of the described 
algorithm depend on the type of regularization and the constraints that are added. The reason is 
twofold. First, as previously mentioned, to ensure that every limit point of the solution sequence 
{A, B, C, W} is a stationary point of Problem (|15|h each subproblem w.r.t. A, B and C needs to 
be a convex problem admitting a unique solution, which depends on the type of regularization and 
the constraints. In addition, the convexity of a subproblem also affects the solution of ADMM - it 
guarantees that ADMM can attain the optimal solution of that subproblem. 

6.2 Initialization Approaches 

IRALS requires initial guesses of A, B and C. In practice, several initialization approaches can be 
considered: 

• First, random initialization is viable. Since the considered problem is nonconvex, using random 
initialization may require restarting the algorithm several times from random initial points to attain 
a good solution, but it also helps the algorithm to avoid ‘bad’ local minima. 

• Second, the loading factors estimated by algorithms that deal with the ^ 2 -norm fitting-based 
PARAFAC problem in (j4]) (or ^ 2 PARAFAC for simplicity), e.g., TALS, can be used as starting 
points. Algorithms tackling the variants of Problem Q with constraints and regularization on the 
loading factors can also be employed. This approach is effective when those algorithms are not 
totally thrown off by the outlying slabs. 

• Third, when / is larger than JK , one can first estimate an orthogonal basis U € M. KJxR such that 
7£(U) = 7Z(C © B), and then apply a Khatri-Rao subspace-based PARAFAC algorithm, such as 
those in [lQU22iS21S3], on the extracted U to get an initial guess of (B, C). Khatri-Rao subspace- 
based initialization is effective with large / since the procedure can ‘compress’ the original tensor 
substantialljil, and PARAFAC algorithms empirically work better when the data size is smaller. 
When there is no outlying slab, and when both C 0 B and A have full-column rank, a basis of 
1Z (C 0 B) can be obtained by applying singular value decomposition (SVD) on X®- Here, since 

2 To be specific, U = (C © B)©, for some © € can be considered as a compressed tensor which has only R 

slabs, whereas the original tensor has I slabs. If R -C 7, the compressed tensor has many fewer slabs. 


15 








there are outlying columns of X®, we can estimate U using robust SVD, which has been intensively 
studied in the recent literature; see, e.g., [331135] . 

7 Numerical Results 

In this section, we first use synthetic data to verify our ideas. Then, real-data experiments will be 
presented to show the effectiveness of the proposed algorithmic framework in practice. The algo¬ 
rithms presented in this section are all implemented in Matlab, and all simulations and experiments 
were carried out on a desktop computer with an i7 3.4GHz quad-core CPU and 8 GB RAM. 


7.1 Synthetic Data Simulations 

In this subsection, we generate the non-negative loading factors of three-way tensors following the 
exponential distribution with fi = 1. The outlier elements are uniformly distributed within zero 
and one, and then are scaled to satisfy the specified simulation conditions (see below). To quantify 
the corruption level, we define the signal-to-outlier ratio (SOR) as 


SOR(dB) 


10 log 


10 


(iA0£-=iE;=i££=i 


(i/M)E ie vllOi 


12 
If 


To benchmark our algorithm, we employ TALS for PARAFAC fitting (i.e., Problem (H|)) and 
the G-norm fitting based PARAFAC ( t\ PARAFAC) [16] with the alternating weighted median 
filtering realization. We fix p = 0.5 throughout this section; our experience is that that the results 
obtained for different p € [0.1,1] are qualitatively similar to those obtained for p = 0.5. IRALS is 
stopped when the absolute change of the objective value is less than 10 -8 or the number of iterations 
reaches 1000. For IRALS with constraints, we stop the ADMM algorithms for the subproblems 
when ||B — Bi ||2 + ||B — B 2 II 2 < 10” 3 following the guidelines in |T9|. IRALS and IRALS with 
constraints are initialized by plain TALS in this subsection. 

Table |T] shows the average mean-squared-errors (MSEs) of the estimated B and C by the 
algorithms under various SORs; the runtime performance is also presented in this table. The MSE 
of the estimated B is defined as 


MSE 


min 

7r(En, 

ci,...,cjE{±1} 


-it 

I< ^ 


3 = 1 


B(:,j) 

c B(:,t tj) 

l|B(:,j)ll2 

B(:,7Tj) 2 


where II is the set of all permutations of {1, 2,..., K}, and B(:, j) and B(:,j) are the ground truth 
of the jth column of B and the corresponding estimate, respectively; the same definition of MSE 
holds for C. We see that for R = 5, IRALS and IRALS with non-negativity constraints (denoted 
by ‘IRALS w./ nn’) both exhibit much lower MSEs compared to TALS and i\ PARAFAC. When 
R = 10, IRALS with non-negativity constraints gives the best MSE performance in general. In 
terms of runtime, the unconstrained IRALS and the IRALS with non-negativity constraints are 
both faster than l\ PARAFAC. Notably, unconstrained IRALS is more than 50 times faster than 
t\ PARAFAC in the presented simulations in this table. 

Table [2] shows the MSEs and runtimes versus the number of outlying slabs. In many cases of 
this simulation, t\ PARAFAC could not yield a reasonable result, and thus it was removed from the 
comparison. For the other three algorithms, we see that IRALS and IRALS with non-negativity 
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constraints can yield reasonable estimation of the loading factors even when the number of outlying 
slabs exceeds a half of the total number of slabs, but TALS gives very poor estimation in this case. 


Table 1: The average MSEs of the estimated B and C by the Algorithms under various SORs; 
(I, J, K) = (20, 20, 20); no. of outlying slabs = 6. 


R= 5 


Algorithm 

Measure 

SOR 

-10 

-5 

0 

5 

10 

TALS 

MSE (dB) 

-10.3345 

-13.9955 

-20.6065 

-28.3189 

-34.2078 

TIME (sec.) 

0.0655 

0.0635 

0.0579 

0.0534 

0.0534 

LI PARAFAC 

MSE (dB) 

-11.6272 

-19.6811 

-25.499 

-28.156 

-66.0477 

TIME (sec.) 

15.1952 

12.8005 

10.9255 

10.4323 

10.1826 

IRALS 

MSE (dB) 

-28.6011 

-46.3832 

-76.4109 

-129.469 

-127.115 

TIME (sec.) 

0.2576 

0.1857 

0.1496 

0.1452 

0.1423 

IRALS w./ mi 

MSE (dB) 

-28.5889 

-64.3808 

-73.2943 

-129.468 

-127.125 

TIME (sec.) 

8.2756 

5.7996 

4.6465 

4.2905 

4.1048 


R = 10 


Algorithm 

Measure 

SOR 

-10 

-5 

0 

5 

10 

TALS 

MSE (dB) 

-9.6438 

-12.9671 

-16.4158 

-24.1415 

-27.1827 

TIME (sec.) 

0.1955 

0.139 

0.1174 

0.0959 

0.0869 

LI PARAFAC 

MSE (dB) 

-7.5341 

-10.0898 

-13.4314 

-15.9167 

-17.7212 

TIME (sec.) 

79.3323 

67.8216 

52.7992 

44.2532 

36.9568 

IRALS 

MSE (dB) 

-19.4927 

-29.2948 

-39.594 

-38.0696 

-68.5139 

TIME (sec.) 

0.573 

0.5305 

0.4496 

0.3681 

0.305 

IRALS w./ nn 

MSE (dB) 

-22.5658 

-33.7498 

-54.5883 

-60.9308 

-67.8565 

TIME (sec.) 

16.3145 

16.4172 

14.061 

11.8115 

9.6166 


Table 2: The average MSEs of the estimated B and C by the Algorithms versus the number of 
outlying slabs; (I, J, K) = (20, 20, 20); SOR = OdB; R = 5. 


Algorithm 

Measure 

number of outlying slabs 

3 

5 

7 

9 

11 

TALS 

MSE (dB) 

-15.0081 

-11.4041 

-9.4724 

-8.4138 

-8.0331 

TIME (sec.) 

0.0592 

0.063 

0.0635 

0.0681 

0.0718 

IRALS 

MSE (dB) 

-30.4836 

-31.3318 

-24.839 

-23.3083 

-23.1527 

TIME (sec.) 

0.1723 

0.217 

0.2349 

0.2474 

0.2461 

IRALS w./ nn 

MSE (dB) 

-37.743 

-40.8854 

-40.8 

-40.4385 

-26.3071 

TIME (sec.) 

3.4883 

4.5614 

5.3182 

5.7222 

5.2961 


Fig. [3] presents the objective values of (1131) against the iterations when applying IRALS and 
IRALS with nonnegativity constraints with different initializations. This simulation is under the 
settings I = J = K = 20, R = 5, and \M\ = 6. Each curve is averaged from 100 trials. We see 
that, when using random initialization, the cost function of IRALS with nonnegativity constraints 
on the loading factors converges much faster than that of IRALS with no constraints. In addition, 
using the output of TALS helps the cost functions of both algorithms converge faster. Specifically, 
under such an initialization scheme, the objective values given by the algorithms both converge 
within 100 iterations. 


7.2 Blind Speech Separation 

In this subsection, we revisit the blind speech separation problem that has been mentioned in Sec. [3] 
We first show a simulation using instantaneously mixed speech sources, where the mixtures follow 
the signal model in Q . The sources are randomly picked from a database that consists of 23 speech 
segments; each source has a length of 3 second, and is sampled at a rate of 16KHz. We use 1 = 5 
sensors and R = 6 sources, which poses a challenging under-determined blind separation problem. 
Each time frame consists of 200 samples - this results in K = 239 time frames (slabs). Spatially 
and temporally white Gaussian noise is added to the received signals. Each local covariance of the 
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Figure 3: The convergence curves of the objective value when applying IRALS with different 
initializations and constraints. 

received signals (i.e., each slab of the PARAFAC model) is calculated using the local sample mean 
of x(i)(x(f)) T , and the noise variance is estimated by 

a 2 = min A min (X(:, k)) , 

k=l,...,K 

where A m ; n (X) denotes the smallest eigenvalue of X. The estimated noise variance is then removed 
from the data; see jlOllllj for details. The mixing system estimation problem can be formulated as 

K E 

min ]T (||X( : >:, k) - AD fc (C) A r ||^ + e) 5 , 

’ k=\ 

and we apply IRALS to the above by treating AD/ c (C)A 7 as AD^(C)B J . In this subsection, we 
use the Khatri-Rao subspace-based initialization as mentioned in Sec. 16.21 since the number of slabs 
(K in this case) is large. 

Fig. HI shows the average MSEs of the estimated mixing system obtained by several algorithms; 
the result is averaged from 100 independent trials. The benchmarked PARAFAC algorithm is 
SOBIUM [22], which is known as a state-of-the-art blind source separation algorithm for the under- 
determined case (i.e., I < J). We see that the proposed algorithm consistently yields around 15dB 
lower MSE than that of SOBIUM, which is a significant performance boost. This phenomenon 
verifies the existence of (significant) modeling error at some slabs, and also shows the effectiveness 
of our proposed algorithm. 

We also consider the convolutive mixture case, in which the signal model can be represented as 

^max 1 

x W= H (^) s (*-^)> 

e=o 

where x.(t) and s(t) are defined as before, and H(f') denotes the mixing system impulse response 
at time lag t. The convolutive mixture model is more realistic, since it captures the multi-path 
reverberation characteristics of real acoustic environments; but is also far more challenging to deal 
with, compared to the instantaneous mixture case. We build up the convolutive mixtures by setting 
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Figure 4: The MSEs of the estimated mixing systems obtained by SOBIUM and the proposed 
algorithm under various SNRs. 



Figure 5: The SIRs obtained by applying SOBIUM and the proposed algorithm to convolutive 
mixtures under various T&: i’s. 

up a simulated room with multiple paths between the speakers and receivers following the image 
method [45]. To separate the sources, we follow the frequency-domain approach MW - the basic 
idea is to transform the mixtures to the frequency domain, where the per-frequency (bin) mixtures 
follow an approximately instantaneous mixing model. Thus, PARAFAC algorithms can be applied 
at each frequency to obtain the source components at that frequency, and the time-domain sources 
can be obtained subsequently using certain post-processing steps, the most critical of which are 
permutation and scaling alignment across the different frequency bins. We measure the quality of 
the unmixed speech signals using the signal-to-interference ratio (SIR) criterion as in [6j[7]; higher 
SIR means better separation performance. Fig. [5] shows the results of using I = 4 sensors to 
separate J = 3 sources; the result is also averaged from 100 trials with randomly picked sources. 
We see that, under different reverberation conditions for the simulated room (a larger Tqq means a 
more severe multipath effect, thereby a more challenging environment for speech separation), the 
proposed algorithm consistently outperforms SOBIUM by around 2dB. 

7.3 Fluorescence Data Analysis 

In this subsection, we deal with a real fluorescence EEM data set - the Dorrit data that is available 
online at http://www.models.life.ku.dk/dorrit. Our working data set has 116 spectral emis- 
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data slab 5 



Figure 6: An outlying slab (left) and a relatively clean slab (right) of the Dorrit data. 

sions, 18 excitations, and 27 samples, which is a tensor with / = 27, J = 116 and K = 18. The 
Dorrit data set is known for containing some badly contaminated slabs, even after pre-processed by 
some automatic scattering removal algorithm m, and there are also some relatively clean samples 
in this data set; see Fig. [6) We formulate the problem of estimating the spectral emissions (B) and 
excitations (C) as 


I p 

mm c ]T (||X(i,:,:) - BD,(A)C r || J + e) 3 

’ ’ i =1 

+ -^a||A||^ + Afc||TB||p + A c ||TC||^ 

s.t. A > 0, B > 0, C > 0, 

where T is defined in (11611 with appropriate dimensions. We add smoothness regularization on B 
and C since we know that the emission and the excitation spectra are smooth in practice; also, 
non-negativity constraints are added to all three loading factors. We should point out that adding 
|| A|||i is important; otherwise, the scaling of B and C can be ‘absorbed’ by A, and the smoothness 
regularization (or, any other scaling-sensitive regularization) may not work. 

In this experiment, we set A& = A c = 10 and A a = 10 -2 and R = 4. Here, we use the l\ and £2 
PARAFAC algorithms with nonnegativity constraints as benchmarks, which are both implemented 
in the A-way toolbox [78] (available at http://www.models . life.ku.dk/source/nwaytoolbox/). 
The result of the nonnegativity-constrained £2 PARAFAC algorithm is used to initialize the pro¬ 
posed algorithm. The estimated B and C by the algorithms are shown in Fig. [71 We also provide 
the emission and excitation spectra obtained from certain ‘pure samples’ containing only a single 
compound. These pure samples are known from prior studies with this particular dataset, and thus 
the recovered spectra are believed to be close to the ground truth - see the row tagged as ‘from pure 
samples’ in Fig. [7] We see that the spectra estimated by the proposed algorithm are visually very 
similar to those measured from the pure samples. However, both of the nonnegativity-constrained 
t\ and £2 PARAFAC algorithms yield clearly worse results - for both of them, an estimated emission 
spectrum and an estimated excitation spectrum are highly inconsistent with the results measured 
from the pure samples. It is also interesting to observe the weights of the slabs given by the pro¬ 
posed algorithm in Fig. [HJ One can see that the algorithm automatically fully downweights slab 5, 
which is consistent with our observation (consistent with domain expert knowledge) that slab 5 is 
an extreme outlying sample (cf. Fig. [ 6 ]). This verifies the effectiveness of our algorithm for joint 
slab selection and model fitting. 
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Figure 7: The estimated emission and excitation curves obtained using the proposed algorithm, as 
well as nonnegativity-constrained 1 2 and l\ PARAFAC fitting. 



Figure 8: The normalized weights of the samples obtained via IRALS. 
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7.4 ENRON E-mail Data Mining 

In this subsection, we apply the proposed algorithm on the celebrated ENRON E-mail corpus. This 
data set contains the e-mail communications between 184 persons within 44 months. Specifically, 
X(z, j, k ) denotes the number of e-mails sent by person i to person j within month k. Many studies 
have been done for mining' the social grouos out of this data set |26II27II49|. In particular. m applied 
a sparsity-regularized and non-negativity-constrained PARAFAC algorithm on this data set, and 
some interesting (and interpretable) results have been obtained. In particular, the significant non¬ 
zero elements of A(:, r) usually correspond to persons with similar ‘social’ positions such as lawyers 
or executives. 

Here, we also aim at mining the social groups out of the ENRON data, while taking data for 
‘outlying months’ into consideration. It is well known that the ENRON company went through a 
criminal investigation and finally hied for bankruptcy. Hence, one may conjecture that the e-mail 
interaction patterns between the social groups might be irregular during the outbreak of the crisis. 
We fit the data using the following formulation: 

K E 

mm c £ (||X(:, k) - AD,(C)B T ||^ + e) 3 

’ ’ k =1 

A a /( A) + Ab||B|||i + A c ||C||p 
s.t. A > 0, B > 0, C > 0, 

where /(A) is a function that promotes sparsity following the insight in |2?j; ||B|||. and ||C||^ are 
added to avoid scaling / counter-scaling issues, as in the previous example. Notice that here we use 
an aggressive sparsity promoting function /(A) from [41], which itself cannot be put in closed form 
- notwithstanding, the proximal operator of /(A) can be written in closed-form, and thus is easy 
to incorporate into our ADMM framework. We fit the ENRON data with R = 5 as in 123, and set 
A a = 6.5 x IIP 2 , At = A c = IIP 3 . The same pre-processing as in |27[[49j is applied to the non-zero 
data to compress the dynamic range; i.e., all the non-zero raw data elements are transformed by 
an element-wise mapping x' = log 2 (x) + 1. As in the last subsection, the proposed algorithm is 
initialized by the nonnegativity-constrained I 2 PARAFAC algorithm. 

Table[3]shows the five social groups mined from the data, corresponding to the non-zero elements 
in the five columns of A. We see that these five groups are quite clean, covering 73 (‘important’) 
persons out of 184 in total. More interestingly, the algorithm automatically downweights the slabs 
corresponding to the period when the company was having a crisis - see Fig. [9] This verifies our 
guess: The interaction pattern during this particular period is not regular, and downweighting 
these slabs can give us more clean social groups. 

8 Conclusion 

In this work, we considered the problem of low-rank tensor decomposition in the presence of outlying 
slabs. Several practical motivating applications have been introduced. A conjugate augmented 
optimization framework has been proposed to deal with the formulated l p minimization-based 
factorization problem. The proposed algorithm features similar complexity as the classic TALS 
algorithm that is not robust to outlying slabs. Regularized and constrained optimization has 
also been considered by employing an ADMM update scheme. Simulations using synthetic data 
and experiments using real data have shown that the proposed approach is promising in different 
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Figure 9: The normalized weights obtained by the proposed algorithm when applied on the ENRON 
e-mail data. 


Table 3: Mining the ENRON E-mail corpus using the proposed algorithm. 


cluster 1 (Legal; 16 persons) 

cluster 2 (Excecutive; 18 persons) 

cluster 3 (Executive; 25 persons) 

Brenda Whitehead, N/A 

Dan Hyvl, N/A 

Debra Perlingiere, Legal Specialist ENA Legal 

Elizabeth Sager, VP and Asst Legal Counsel ENA Legal 

Jeff Hodge, Asst General Counsel ENA Legal 

Kay Mann, Lawyer 

Louise Kitchen, President Enron Online 

Marie Heard, Senior Legal Specialist ENA Legal 

Mark Haedicke, Managing Director ENA Legal 

Mark Taylor , Manager Financial Trading Group ENA Legal 
Richard Sanders, VP Enron Wholesale Services 

Sara Shackleton, Employee ENA Legal 

Stacy Dickson, Employee ENA Legal 

Stephanie Panus, Senior Legal Specialist ENA Legal 

Susan Bailey, Legal Assistant ENA Legal 

Tana Jones, Employee Financial Trading Group ENA Legal 

David Delainey, CEO ENA and Enron Energy Services 

Drew Fossum, VP Transwestern Pipeline Company (ETS) 

Elizabeth Sager, VP and Asst Legal Counsel ENA Legal 

James Steffes, VP Government Affairs 

Jeff Dasovich, Employee Government Relationship Executive 

John Lavorato, CEO Enron America 

Kay Mann, Lawyer 

Kevin Presto, VP East Power Trading 

Margaret Carson, Employee Corporate and Environmental Policy 

Mark Haedicke, Managing Director ENA Legal 

Philip Allen, VP West Desk Gas Trading 

Richard Sanders, VP Enron Wholesale Services 

Richard Shapiro , VP Regulatory Affairs 

Sally Beck, COO 

Shelley Corman, VP Regulatory Affairs 

Steven Kean, VP Chief of Staff 

Susan Scott, Employee Transwestern Pipeline Company (ETS) 

Vince Kaminski, Manager Risk Management Head 

Andy Zipper , VP Enron Online 

Jeffrey Shankman, President Enron Global Markets 

Barry Tycholiz, VP Marketing 

Richard Sanders, VP Enron Wholesale Services 

James Steffes, VP Government Affairs 

Mark Haedicke, Managing Director ENA Legal 

Greg Whalley, President 

Jeff Dasovich, Employee Government Relationship Executive 
Jeffery Skilling, CEO 

Vince Kaminski, Manager Risk Management Head 

Steven Kean, VP Chief of Staff 

Joannie Williamson, Executive Assistant 

John Arnold, VP Financial Enron Online 

John Lavorato, CEO Enron America 

Jonathan McKa, Director Canada Gas Trading 

Kenneth Lay, CEO 

Liz Taylor, Executive Assistant to Greg Whalley 

Louise Kitchen, President Enron Online 

Michelle Cash, N/A 

Mike McConnel, Executive VP Global Markets 

Kevin Presto, VP East Power Trading 

Richard Shapiro, VP Regulatory Affairs 

Rick Buy, Manager Chief Risk Management Officer 

Sally Beck, COO 

Hunter Shively, VP Central Desk Gas Trading 

cluser 4 (Trading; 12 persons) 

cluster 5 (Pipeline; 15 persons) 

Chris Dorland, Manager 

Eric Bas, Trader Texas Desk Gas Trading 

Philip Allen, Manager 

Kam Keiser, Employee Gas 

Mark Whitt, Director Marketing 

Martin Cuilla, Manager Central Desk Gas Trading 

Matthew Lenhart, Analyst West Desk Gas Trading 

Michael Grigsby, Director West Desk Gas Trading 

Monique Sanchez, Associate West Desk Gas Trader (EWS) 
Susan Scott, Employee Transwestern Pipeline Company (ETS) 
Jane Tholt, VP West Desk Gas Trading 

Philip Allen, VP West Desk Gas Trading 

Bill Rapp, N/A 

Darrell Schoolcraft, Employee Gas Control (ETS) 

Drew Fossum, VP Transwestern Pipeline Company (ETS) 

Kevin Hyatt, Director Asset Development TW Pipeline Business (ETS) 
Kimberly Watson, Employee Transwestern Pipeline Company (ETS) 
Lindy Donoho, Employee Transwestern Pipeline Company (ETS) 

Lynn Blair, Employee Northern Natural Gas Pipeline (ETS) 

Mark McConnell, Employee Transwestern Pipeline Company (ETS) 
Michelle Lokay, Admin. Asst. Transwestern Pipeline Company (ETS) 
Rod Hayslett, VP Also CFO and Treasurer 

Shelley Corman, VP Regulatory Affairs 

Stanley Horton, President Enron Gas Pipeline 

Susan Scott, Employee Transwestern Pipeline Company (ETS) 

Teb Lokey, Manager Regulatory Affairs 

Tracy Geaccone, Manager (ETS) 
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pertinent applications such as blind speech separation, fluorescence data spectroscopy, and social 
network mining. 


Appendix 


A Proof of Claim [T] 

Consider a feasible solution (A,B,C), where A (A f c ,:) = A (A F c , :)IIA a , B = BIIAft, and C = 
CIIA c . Consequently, it can be seen that for all i € A f c we have 


X^(:,i)-(COB)A(i ,:) 5 


= 0 . 


Hence, the optimal value of the cost function satisfies 


^min 0 I | A c | 0 


I — C 
2 


Now, we show that there is no other solution that leads to a smaller objective value. Suppose 
that there exists an index set S C J\[ c such that (some of) the slabs indexed by i € S (J A constitute 
a PARAFAC model whose loading matrices do not contain BIIA^ or CIIA c . We show that |<S| < c. 
In fact, if |<S| > c, then, with probability one, the slabs that belong to S can only be decomposed 
using B, C and A(«S,:) with a common column permutation and scaling. The reason is as follows. 
By the assumption that A is drawn from some absolutely continuous distribution, we see that 
k A ( 5 = min{|<S|, /?,} > minjc, f?} = c holds with probability one, and thus k A (g,:) + min{J, R} + 

minjA, R} > 2R + 2 holds almost surely. Hence, by the uniqueness condition mentioned in Q, 
the PARAFAC decomposition of X(<S,:) is essentially unique with probability one. Thus, it can 
be seen that if the solution to Problem (|HJ) does not satisfy B* = BIIA^, and C* = CIIA c , we 
must have 


> I- 


A JcSl >/- 


I + c I — c 


where we have used the fact that |A(J S\<(I + c)/2. 

It remains to show that A*(A C ,:) = A(A C , :)IIA a . In fact, given that the optimal solution 
satisfies B* = BIIAft, and C* = CIIA c , the optimal A* should be able to make 


^(I|X (3) (A) - (c*©b*)a*(v) t || 2 ) = 0, 


( 20 ) 


for as many as possible V s. For % € A c , we conclude A*(A C ,:) = A(A C , :)IIA a . The reason, 
again, lies in the uniqueness result in (|3|): Since |A C | > (/ + c)/2 > c, we have /ca(,V C! :) 0 c with 
probability one, since A is drawn from an absolutely continuous distribution over R IxR . Hence, 
^’AWcO + ^’B + fcc > 2R+2 holds with probability one. Consequently, the PARAFAC decomposition 
of X(A C ,:,:) is essentially unique. This implies A*(A C ,:) = A(A C , :)I I A (1 . 

B Proof of Claim [2] 

To relate the stationary points of Problem (11311 to the stationary points of Problem (HOI) , let us 
denote the cost functions of Problem (fTUl) and Problem (1T31) as 'ki(A, B, C) and T 2 (A, B, C, W), 
respectively. We see that 

'h 1 (A, B, C) = min T 2 (A,B,C ,{wi}j =1 ). 
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Let us consider (A*, B*, C*, {w*}[ =l ) as a stationary point of Problem (fl3l) . Following Lemma [Q 
a direct observation is that 


\Li(A*, B*, C*) 


4/ 2 (A*,B*,C*,{u;*}f = i), 


( 21 ) 


since \H 2 (A, B, C, (rcj}f =1 ) has a unique stationary point w.r.t. {wiY i=l on the interior of the non¬ 
negative orthant, which is the optimal solution w.r.t. {u;j}[ =1 . Hence, one can see that A*, B*, C* 
is also a stationary point of \Hi(A, B, C). In fact, taking A* as an example, we see that, following 

(ED, 


Tr (V a '3/ 2 (A*, B*, C*, {<}f =1 ) T (A - A*)) < 0 
=> Tr (V a ^i(A*, B*, C*) t (A - A*)) < 0, 


which implies that A is also a stationary point of Problem (|10l) . The same proof applies to B and 

C. 

C ADMM Updates w.r.t. C and A 

Now, let us consider the update of C: 


min 

c 

s.t. 


- (I®W)X< 2 )-(B 0 WA)C t 

C > 0. 


2 + \ c h(C) 

F 


( 22 ) 


By applying the same structure of ADMM, we come up with 

Cf := (B r B © A t W 2 A + pl) _1 x 
((B © W 2 A) t X ( 2) + p(C + V X ) T ) 

C 2 := argmin A c /i(C 2 ) + ^||Ci - C + V 2 ||^ 

C 2 2 

C:= Q( Ca-Va + Ci-Vx)) 

Vi := Vi + C - Ci 
V 2 := V 2 + C - C 2 . 

The update w.r.t. A is even simpler: 

A? := (C T C © B t B + pi ) _1 ((C © B) T X (3) + p(A + Zi) T ) 

A 2 := argminA a /(A 2 ) + ^||A - A x + Z 2 ||| 

A 2 2 

A = Q(Ai-Zi + A 2 -Z 2 )^ 

Zi := Zi + A — Ai 
Z 2 := Z 2 + A — A 2 . 


(23a) 

(23b) 

(23c) 

(23d) 

(23e) 

(24a) 

(24b) 
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(24c) 

(24d) 

(24e) 
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